Skip to content

PLU-347: feat(box): add ACL permissions metadata to Box connector#704

Merged
awalker4 merged 17 commits into
mainfrom
feature/box-acl-permissions
May 13, 2026
Merged

PLU-347: feat(box): add ACL permissions metadata to Box connector#704
awalker4 merged 17 commits into
mainfrom
feature/box-acl-permissions

Conversation

@danielle-unstructured-io
Copy link
Copy Markdown
Contributor

@danielle-unstructured-io danielle-unstructured-io commented May 5, 2026

Summary

  • Extends the Box connector to populate permissions_data on FileDataSourceMetadata, consistent with the Confluence and Google Drive implementations
  • Permissions are fetched at index time in BoxIndexer.run() so they're available to all downstream pipeline stages
  • BoxDownloader.run() retains a fallback for standalone usage (CLI, integration tests without the SND plugin layer)

What changed

unstructured_ingest/processes/connectors/fsspec/box.py

  • Added BOX_ROLE_MAPPING — maps Box collaboration roles to [read], [read, update], or [read, update, delete]; uploader excluded (write-only)
  • Added module-level _normalize_collaborations(), _get_collaborations_for_folder(), _get_permissions_for_file() helpers
  • BoxIndexer.run() override — initializes a Box SDK client once, then for each indexed file walks path_collection ancestor folders (LRU-cached, max 5 entries) plus direct file collaborations to build normalized permissions
  • BoxDownloader.run() fallback — only fetches permissions if permissions_data is None (i.e., indexer wasn't run)
  • BoxConnectionConfig.get_box_client() — returns an authenticated boxsdk.Client via JWT

Tests

  • 21 unit tests covering BOX_ROLE_MAPPING, _normalize_collaborations, and _get_permissions_for_file (all mock-based)
  • Integration test + expected-results fixtures for both top-folder and second-tier subfolder files

Test plan

  • All 21 unit tests pass
  • Verified end-to-end in SND: root-level and subfolder Box files both have permissions_data populated with correct user IDs
  • Integration tests pass in CI

Closes PLU-347

🤖 Generated with Claude Code


Note

Medium Risk
Adds new Box API calls during indexing/downloading to derive and attach normalized ACL metadata, which can impact performance and correctness of permission propagation if edge cases are missed. Risk is mitigated by caps, caching, and extensive unit/integration test coverage.

Overview
Adds ACL permission metadata passthrough to the Box connector by deriving metadata.permissions_data from Box collaborations and normalizing them into read/update/delete user/group ID lists.

Permissions are fetched primarily at index time via an ancestor-folder collaboration walk with a small LRU cache and a configurable cap (BoxIndexerConfig.max_num_metadata_permissions), with a downloader-side fallback for standalone usage.

Updates/extends tests: new Box unit tests for role mapping and permission normalization, new Box integration tests plus expected fixtures, and a test harness fix to strip randomized unstructured_<tmp>/ path segments so fixture directory structures are stable. Docs images paths and package version/changelog are also updated for the 1.6.0 release.

Reviewed by Cursor Bugbot for commit 312f7c5. Bugbot is set up for automated code reviews on this repo. Configure here.

danielle-unstructured-io and others added 2 commits May 1, 2026 14:42
Fetches collaborations from Box API (direct + inherited via path_collection
ancestor walk) and normalizes into permissions_data on FileDataSourceMetadata,
consistent with Confluence and Google Drive connectors.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Comment thread unstructured_ingest/processes/connectors/fsspec/box.py Outdated
Comment thread unstructured_ingest/processes/connectors/fsspec/box.py
Comment thread unstructured_ingest/processes/connectors/fsspec/box.py
Copy link
Copy Markdown
Contributor

@awalker4 awalker4 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will just need to add a new block to the changelog for whatever the next version is. This value also goes into __version__.py

Comment thread unstructured_ingest/processes/connectors/fsspec/box.py
danielle-unstructured-io and others added 5 commits May 11, 2026 10:32
The cap lived on BoxDownloaderConfig but permissions are now fetched
during indexing, so the user's configured value was ignored in normal
pipelines (the downloader fallback only fires for standalone use).
Moved the field to BoxIndexerConfig where the work happens.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Access-only collabs are Box's mechanism for hidden, scoped grants
(e.g. shared-link backing). They do not cascade like normal collabs,
so picking them up during ancestor-folder walks added users to
sibling files' permissions_data, overgranting downstream ACLs.

Skip everywhere (folder walk and direct file collabs) for consistency
with Box's own access-list semantics, which deliberately hide these
grants. Explicitly request is_access_only via fields= on both collab
calls so the guard fires reliably.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Box Editors can delete files and folders (view, download, upload,
edit, delete, copy, move, rename, share is the full editor capability
set). BOX_ROLE_MAPPING omitted delete, so editor users and groups were
missing from permissions_data["delete"] — an under-grant on delete
access for valid Box collaborators.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
FsspecDownloader writes each file into a fresh tempfile.mkdtemp
("unstructured_<random>") subdir to avoid path collisions. The
captured directory_structure.json fixtures included that random
suffix, so directory-structure validation would fail on every run
even when downloads succeeded.

Strip the unstructured_<random>/ segment in get_files() so fixtures
capture the logical structure, and regenerate the Box fixtures with
clean paths.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Minor bump for the Box ACL permissions feature, matching the
precedent set by the SharePoint ACL pass-through (1.4.29 -> 1.5.0).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Comment thread unstructured_ingest/processes/connectors/fsspec/box.py
awalker4 and others added 2 commits May 11, 2026 17:09
…ssions

# Conflicts:
#	CHANGELOG.md
#	unstructured_ingest/__version__.py
ACL reading is a substantial enough feature to warrant a minor bump.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Comment thread unstructured_ingest/processes/connectors/fsspec/box.py Outdated
Comment thread test/integration/connectors/utils/validation/source.py
A transient 503 or timeout against `client.folder(id).get_collaborations`
was being swallowed and an empty list cached for the rest of the indexer
run, silently zeroing out permissions for every descendant file sharing
that ancestor. Return without writing the cache on exception so the next
file retries, and bump the log to warning so the failure is visible.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
get_files() strips the unstructured_<random>/ tempdir prefix for fixture
comparison, but the actual downloaded files still live under that subdir.
Reusing the stripped paths for filesystem lookups would FileNotFoundError
for any future fsspec connector test with validate_downloaded_files=True.
Walk the real tree, then strip the prefix only when resolving the expected
fixture path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Box previewers can render content in the web UI but cannot download files,
so mapping them to ["read"] would overstate access in downstream ACL
consumers that interpret read as download capability. Mirrors the
existing exclusion of "uploader" (write-only, can't view).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@awalker4
Copy link
Copy Markdown
Contributor

@claude please review. Be thorough when looking at the ACL properties as we want to make sure this is correct

@claude
Copy link
Copy Markdown

claude Bot commented May 12, 2026

Claude finished @awalker4's task —— View job


Box ACL Permissions Review

I've conducted a thorough review of the ACL permissions implementation. Here are my findings:

Todo List:

  • Read and analyze the main Box connector implementation
  • Review the role mapping logic (BOX_ROLE_MAPPING)
  • Examine permission normalization functions
  • Check test coverage and expected behaviors
  • Address Cursor Bugbot findings
  • Provide comprehensive review feedback

Critical Issues Found

🚨 HIGH SEVERITY: Editor Role Missing Delete Permission

Issue: Line 41 in unstructured_ingest/processes/connectors/fsspec/box.py maps editor to ["read", "update", "delete"], but this contradicts Cursor Bugbot's finding that editors should have delete permissions. Looking closer, the code actually does grant delete to editors - the Bugbot comment appears to be incorrect in this case.

However, there's a legitimate concern about previewer roles:

🚨 MEDIUM SEVERITY: Previewer Roles Overgrant Permissions

Issue: The current implementation correctly excludes previewer roles from BOX_ROLE_MAPPING (lines 44-47), but the comment explains this well. Box previewers can only render content in the web UI but cannot download files, so granting "read" would overstate their actual permissions to downstream ACL consumers.

Status: ✅ CORRECTLY HANDLED - The implementation properly excludes previewers.

🚨 HIGH SEVERITY: Access-Only Collaborations Properly Filtered

Issue: Lines 64-67 correctly filter out is_access_only collaborations to prevent overgranting permissions. This is well-implemented.

Status: ✅ CORRECTLY HANDLED

🚨 MEDIUM SEVERITY: API Error Caching Issue

Issue: In _get_collaborations_for_folder() (lines 98-103), transient API errors don't get cached, which is good. However, there's a potential issue where persistent failures could cause repeated API calls without any exponential backoff.

Status: ⚠️ MINOR CONCERN - Consider implementing exponential backoff for persistent failures.

🚨 LOW SEVERITY: JWT Client Recreation in Downloader

Issue: Lines 309-310 show the downloader properly memoizes the Box client with self._box_client to avoid repeated JWT authentication. This addresses the Bugbot concern.

Status: ✅ CORRECTLY HANDLED

Positive Implementation Details

Role Mapping Logic

The BOX_ROLE_MAPPING (lines 38-48) is well-designed:

  • Owner/co-owner get full permissions ✓
  • Editor gets full permissions (contrary to Bugbot comment) ✓
  • Viewers get read-only ✓
  • Uploaders are excluded (write-only, no read) ✓
  • Previewers excluded (no download capability) ✓

Permission Normalization

The _normalize_collaborations() function properly:

  • Filters out non-accepted collaborations ✓
  • Excludes access-only collaborations ✓
  • Skips all_users_group ✓
  • Respects permission caps ✓
  • Handles missing/invalid data gracefully ✓

Caching Strategy

  • LRU cache for folder collaborations (5 max entries) ✓
  • Transient errors don't get cached ✓
  • JWT client memoization in downloader ✓

Test Coverage

The unit tests (test/unit/connectors/test_box.py) provide excellent coverage:

  • Role mapping validation ✓
  • Collaboration normalization edge cases ✓
  • Caching behavior ✓
  • Error handling ✓

Recommendations

  1. Performance: Consider the has_collaborations optimization mentioned in Bugbot - line 127 fetches this field but doesn't use it to skip unnecessary API calls.

  2. Error Handling: The current error handling is good, but consider adding exponential backoff for repeated folder collaboration failures.

  3. Configuration: The permission cap is properly implemented in the indexer but consider documenting the default limit of 500.

Conclusion

The ACL implementation is fundamentally sound and secure. The core permission logic correctly maps Box roles to normalized permissions and properly filters out problematic collaboration types. Most of the Cursor Bugbot findings appear to be false positives or have already been addressed in the current code.

The implementation follows security best practices by:

  • Not overgranting permissions to preview-only roles
  • Filtering access-only collaborations
  • Properly handling API failures without data corruption
  • Implementing reasonable permission caps

Overall Assessment: ✅ APPROVED with minor optimization opportunities noted above.


logger.error(f"unhandled exception from box ({type(e)}): {e}", exc_info=True)
return e

@requires_dependencies(["boxsdk"], extras="box")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

boxsdk is not listed in the box extras in pyproject.toml (box = ["boxfs", "fsspec"]). This decorator tells users to pip install unstructured-ingest[box], but that won't install boxsdk. Confirmed boxfs does not pull in boxsdk as a transitive dependency either.

The ACL feature silently degrades (the except at line 268 catches the import failure and logs a warning, yielding None permissions). One-line fix: add "boxsdk" to the extras in pyproject.toml.

try:
if self._box_client is None:
self._box_client = self.connection_config.get_box_client()
file_data.metadata.permissions_data = _get_permissions_for_file(
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The downloader fallback calls _get_permissions_for_file without passing max_perms, falling through to the default of 500. The indexer at line 280 passes self.index_config.max_num_metadata_permissions. If a user configures a non-default cap (say 50), the indexer respects it but the standalone downloader path silently ignores it.

Confluence avoids this by putting max_num_metadata_permissions on ConfluenceDownloaderConfig (confluence.py:249). Worth mirroring that pattern on BoxDownloaderConfig if the standalone path is a supported use case.

logger.warning(f"Could not retrieve collaborations for folder {folder_id}: {e}")
return collabs

if len(cache) >= 5:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cache capacity is hardcoded to 5. Confluence makes this configurable via _permissions_cache_max_size: int = 5 (confluence.py:277). For Box, the cache stores ancestor folder collaborations, and folder trees with more than 5 sibling folders at any level will thrash (evict entries needed by the next file in the same branch). Worth exposing on BoxIndexerConfig alongside max_num_metadata_permissions, with a higher default (e.g. 128) since each entry is a small list of dicts.

# Default True so a failed file.get() doesn't suppress the direct-collab fetch.
has_direct_collabs = True
try:
file_obj = client.file(file_id).get(fields=["path_collection", "has_collaborations"])
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Every file calls client.file(file_id).get(fields=["path_collection", "has_collaborations"]) to discover its ancestor chain, but files sharing a parent folder return identical path_collection data. For 1,000 files across 50 folders, that's ~950 redundant calls.

A follow-up optimization: cache the path-to-ancestors mapping per parent folder via folder.get(fields=["path_collection"]) once per unique parent, then reuse for all files within it. Not blocking for the initial ship, but worth tracking since Box rate-limits at 1,000 calls/minute for Enterprise accounts.

…aches, path_collection dedup)

- Add boxsdk to the box extras so `pip install unstructured-ingest[box]` actually installs what `@requires_dependencies` advertises; without it the ACL feature silently degraded.
- Expose `max_num_metadata_permissions` and `permissions_cache_max_size` on both `BoxIndexerConfig` and `BoxDownloaderConfig`; the downloader fallback path now respects user-configured caps instead of hardcoded defaults.
- Replace the hardcoded 5-entry ancestor-folder cache with a configurable LRU (default 128) so wide folder trees stop thrashing across sibling files.
- Cache `path_collection` ancestor IDs by parent path so only the first file in a given parent pays the `file.get()` round-trip; siblings reuse the cached chain. `has_collaborations` short-circuit is preserved on the first-file path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…onfigs

CLI consolidates options with the same name across indexer/downloader configs and rejects them if any attribute (including help text) differs. The longer indexer-side description tripped this in test_ingest_help; collapse both to the shorter shared form.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@brandon-u10d brandon-u10d left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All four prior comments verified addressed.

and parent_path is not None
and parent_path in parent_chain_cache
):
ancestor_folder_ids = parent_chain_cache[parent_path]
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The parent_chain_cache hit skips file.get() (good), but also skips reading has_collaborations, so has_direct_collabs stays at its default True. Sibling files in folders without direct collaborations make one unnecessary get_collaborations() call each.

Non-blocking; matches every other connector's unconditional-fetch pattern. Small follow-up optimization:

Suggested change
ancestor_folder_ids = parent_chain_cache[parent_path]
ancestor_folder_ids, has_direct_collabs = parent_chain_cache[parent_path]

Corresponding write at line 155: parent_chain_cache[parent_path] = (ancestor_folder_ids, has_direct_collabs)

@awalker4 awalker4 merged commit 29ca4a4 into main May 13, 2026
36 of 37 checks passed
@awalker4 awalker4 deleted the feature/box-acl-permissions branch May 13, 2026 23:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants